Method for video matting via sparse and low-rank representation

ABSTRACT

The present invention provides a method for video matting via sparse and low-rank representation, which firstly selects frames which represent video characteristics in input video as keyframes, then trains a dictionary according to known pixels in the keyframes, next obtains a reconstruction coefficient satisfying the restriction of low-rank, sparse and non-negative according to the dictionary, and sets the non-local relationship matrix between each pixel in the input video according to the reconstruction coefficient, meanwhile sets the Laplace matrix between multiple frames, obtains a video alpha matte of the input video, according to α values of the known pixels of the input video and α values of sample points in the dictionary, the non-local relationship matrix and the Laplace matrix; and finally extracts a foreground object in the input video according to the video alpha matte, therefore improving quality of the extracted foreground object.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Chinese Patent Application No.201510695505.9, filed on Oct. 23, 2015, the content of which is herebyincorporated by reference in its entirety.

TECHNICAL FIELD

The present invention relates to image processing technologies and, inparticular, relates to a method for video matting via sparse andlow-rank representation.

BACKGROUND

Video matting aims at extracting moving foreground object, and ensuringa good temporal and spatial consistency. As an important technicalproblem in the field of computer vision technology, video matting iswidely used in the fields of hair modeling, defogging, etc. In recentyears, many matting methods have been proposed successively to achieveextracting high quality foreground object in complex video and image.

Since sparse representation is widely used in the fields of facerecognition, image classification, image restoration and video denoisingetc., Jubin et al proposed an image matting method based on the sparserepresentation, which reconstructs an original image with foregroundpixels of a whole video and estimates opacity α (alpha) values of pixelsaccording to a sum of coefficients corresponding to each pixel in asparse representation coefficient matrix. The method can selectappropriate sample points to reconstruct the original imageautomatically, however, it fails to guarantee similar α values of pixelspossessing similar characteristics, therefore fails to guarantee thetemporally and spatially consistency of video alpha matte. Furthermore,since only foreground pixel is used as a dictionary, the representativeability is poor, leading to a poor quality of the foreground objectextracted by applying said method.

X. Chen and Q. Chen et al proposed a method of introducing non-localprior to obtain video alpha matte, which improves extraction quality byconstructing non-local structure of video alpha matte. When implementingsaid method, a fixed number of sample points are selected directly foreach pixel to reconstruct said pixel. However, selecting less samplepoints will lead to missing of good sample points, meanwhile selectingexcessive sample points will lead to noise. Furthermore, it is difficultto construct a consistent non-local structure for pixels possessingsimilar characteristics, which may result in temporal and spatialinconsistency of video alpha matte, therefore the quality of aforeground object extracted by adopting said method is poor.

The above two methods, when processing video foreground objectextraction, have many shortcomings which lead to that the quality ofextracted background object is poor, therefore, it is necessary topropose a new solution to improve the quality of the extractedforeground object.

SUMMARY

Aiming at the above-mentioned disadvantages of the prior art, thepresent invention provides a method for video matting via sparse andlow-rank representation, so as to improve quality of extractedforeground object.

The present invention provides a method for video matting via sparse andlow-rank representation, including:

determining known pixels and unknown pixels in an input video, settingopacity α values of the known pixels, and selecting frames which canrepresent video characteristics in the input video as keyframes;training a dictionary according to the known pixels in the keyframes,and setting α values of sample points in the dictionary; obtaining areconstruction coefficient of the input video corresponding to thedictionary according to the dictionary, and setting a non-localrelationship matrix between each pixel in the input video according tothe reconstruction coefficient; setting a Laplace matrix betweenmultiple frames; obtaining a video alpha matte of the input video,according to the α values of the known pixels of the input video and theα values of sample points in the dictionary, the non-local relationshipmatrix and the Laplace matrix; and extracting a foreground object in theinput video according to the video alpha matte.

In an embodiment of the present invention, the determining the knownpixels and the unknown pixels in the input video, specificallyincluding:

determining the known pixels and the unknown pixels in the input videoby using a pen-based interaction marking; or determining the knownpixels and the unknown pixels in the input video according to a trimapof the input video.

In an embodiment of the present invention, the setting the opacity αvalues of the known pixels, specifically including:

setting α values of known foreground pixels as 1, and setting α valuesof known background pixels as 0.

In an embodiment of the present invention, the training the dictionaryaccording to the known pixels in the keyframes, specifically including:

training the dictionary by minimizing following energy equation (1):

$\begin{matrix}{\underset{({D,Z})}{\arg \; \min}{\sum\limits_{i,j}\left( {{{\hat{X} - {D\; Z}}}_{F}^{2} + {{{\hat{X}}_{i} - {D_{i}Z_{i}}}}_{F}^{2} + {\sum\limits_{j \neq i}{{D_{j}Z_{i}^{j}}}_{F}^{2}}} \right)}} & (1)\end{matrix}$

wherein, {circumflex over (X)}={{circumflex over (X)}_(f),{circumflexover (X)}_(b)} represents the known pixels in the keyframes; f and{circumflex over (X)}_(b) represent the known foreground pixels andbackground pixels in the keyframes respectively; D={D_(f),D_(b)}represents the trained dictionary, D_(f) and D_(b) represent theforeground dictionary and the background dictionary respectively;Z_(f)={Z_(f) ^(f),Z_(f) ^(b)} represents a reconstruction coefficient ofthe foreground pixel {circumflex over (X)}f corresponding to thedictionary D, Z_(b)={Z_(b) ^(f),Z_(b) ^(b)} represents a reconstructioncoefficient of the background pixel {circumflex over (X)}_(b)corresponding to the dictionary D, and {Z_(i) ^(j)|i,j=f,b} represents areconstruction coefficient of a known point {circumflex over (X)}_(i)corresponding to the dictionary D_(j).

In an embodiment of the present invention, the obtaining thereconstruction coefficient of the input video corresponding to thedictionary according to the dictionary, specifically including:

obtaining the reconstruction coefficient of the input videocorresponding to the dictionary by minimizing following energy equation(2):

$\begin{matrix}{{{\min {\sum\limits_{i}^{n}\left( {{{X_{i} - {DW}_{i}}}_{0} + {W_{i}}_{0}} \right)}} + {W}_{*}}\mspace{14mu} {{\forall p},q,{\left( w_{i} \right)_{p,q} \in W_{i}},{{s.t.\left( w_{i} \right)_{p,q}}>=0.}}} & (2)\end{matrix}$

wherein, X={X₁, . . . , X_(n)}, n represents that there are n frames inthe input video, X_(i) represents RGBXY characteristics of the i^(th)frame, ∥•∥_(*) represents a nuclear norm, which is the sum of singularvalues of the matrix and is used for restraining the reconstructioncoefficient being low-rank, ∥•∥₀ represents a zero norm, which is thenumber of nonzero elements and is used for restraining thereconstruction coefficient being sparse,

${W = {\left\{ {W_{1},K,W_{n}} \right\} = {\begin{matrix}{\left( w_{1} \right)_{1,1}L} & {\left( w_{i} \right)_{1,p}L} & \left( w_{n} \right)_{1,m} \\{M\mspace{14mu} O} & M & M \\{\left( w_{1} \right)_{q,1}L} & {\left( w_{i} \right)_{q,p}L} & \left( w_{n} \right)_{q,m} \\M & {M\mspace{14mu} O} & M \\{\left( w_{1} \right)_{t,1}L} & {\left( w_{i} \right)_{t,p}L} & \left( w_{n} \right)_{t,m}\end{matrix}}}},$

m represents that the number of pixel in each frame is m, t representsthat the number of sample points in the dictionary D is t, (w_(i))_(q,p)represents the reconstruction coefficient of the p^(th) pixel in thei^(th) frame corresponding to the q^(th) sample point in the dictionary.

In an embodiment of the present invention, the setting the non-localrelationship matrix between each pixel in the input video according tothe reconstruction coefficient, specifically including:

setting the non-local relationship matrix according formula (3):

$\begin{matrix}{\min {\sum\limits_{i}^{n}{\sum\limits_{j}^{m}\left( {\alpha_{ij} - {\alpha_{D}w_{ij}}} \right)^{2}}}} & (3)\end{matrix}$

wherein, α_(ij) represents the α value of the j^(th) pixel in i^(th)frame, m represents the number of pixels in each frame is m,α_(D)={α_(f),α_(b)} represents α values of all sample points in thedictionary D, α_(f)=1 represents α values of sample points in theforeground dictionary, α_(b)=0 represents α values of sample points inthe background dictionary, w_(ij)=[(w_(i))_(1,j), . . . , (w_(i))_(t,j)]represents the reconstruction coefficient of the j^(th) pixel in thei^(th) frame corresponding to the dictionary D.

In an embodiment of the present invention, the setting Laplace matrixbetween multiple frames, specifically including:

setting a Laplace matrix between multiple frames according formula (4):

$\begin{matrix}{W_{ij}^{mlap} = {\delta {\sum\limits_{k}^{{({i,j})} \in c_{k}}{\frac{1 + {\left( {C_{i} - \mu_{k}} \right)\left( {\Sigma_{k} + {\frac{\delta}{d \times \mu^{2}}I}} \right)^{- 1}\left( {C_{j} - \mu_{k}} \right)}}{d \times m^{2}}.}}}} & (4)\end{matrix}$

wherein, W_(ij) ^(mlap) represents a Laplace matrix, δ controls aintensity of local smoothness, k represents the number of the windows inone frame, c_(k) represents the k^(th) window, C_(i) represents a colorvalue of the i^(th) pixel, μ_(k) and Σ_(k) represent mean and varianceof the color in the window respectively, ò represents a normalcoefficient, d×m² is a size of the window which represents selectingneighboring d frames and each frame selects pixels in m² window asneighbors, and I represents an identity matrix.

In an embodiment of the present invention, set the normal coefficient òas 10⁻⁵, set m as 3, and set d as 2.

In an embodiment of the present invention, the obtaining the video alphamatte of the input video, according to α values of the known pixels ofthe input video and α values of sample points in the dictionary, thenon-local relationship matrix and the Laplace matrix, specificallyincluding:

obtaining α values of the unknown pixels of the input video accordingformula (5):

$\begin{matrix}{E = {{\lambda {\sum\limits_{s \in S}\left( {\alpha_{s} - g_{s}} \right)^{2}}} + {\sum\limits_{i = 1}^{n}{\sum\limits_{j = 1}^{m}\left( {\alpha_{ij} - {\alpha_{D}w_{ij}}} \right)^{2}}} + {\sum\limits_{i = 1}^{n}{\sum\limits_{j = 1}^{m}\left( {\Sigma_{({k \in N_{j}}}{W_{jt}^{mlap}\left( {\alpha_{ij} - \alpha_{k}} \right)}} \right)^{2}}}}} & (5)\end{matrix}$

wherein, S represents a set constructed by α values of the known pixelsof the input video and α values of sample points in the dictionary,N_(j) is adjacent points of pixel j in d×m² window, g_(s)=1 representspixel s in set S is a foreground pixel, and g_(s)=0 represents pixel sin set S is a background pixel; and

obtaining the video alpha matte of the input video according to α valuesof the known pixels and α values of the unknown pixels of the inputvideo.

The method for video matting via sparse and low-rank representationprovided by the present embodiment, trains the dictionary with strongrepresenting ability according to the known foreground pixels andbackground pixels in the selected keyframes, then obtains areconstruction coefficient which satisfies the restriction of low-rank,sparse and non-negative according to the dictionary, sets the non-localrelationship matrix between each pixel in the input video according tothe reconstruction coefficient, meanwhile sets the Laplace matrixbetween multiple frames, therefore guarantees the temporal and spatialconsistency and local smoothness of obtained video alpha matte of theinput video. Furthermore, the quality of foreground object extractedaccording the video alpha matte is improved effectively.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a flow chart of a method for video matting via sparse andlow-rank representation according to a first embodiment of the presentinvention;

FIG. 2 is a construction diagram of a Laplace matrix between multipleframes in the present invention; and

FIG. 3 is a flow chart of a method for video matting via sparse andlow-rank representation according to a second embodiment of the presentinvention.

DESCRIPTION OF EMBODIMENTS

In order to illustrate the objects, technical solutions and advantagesof the present invention more clearly, embodiments of the presentinvention are described in further details with reference to theaccompanying drawings. Obviously, the embodiments described are onlysome exemplary embodiments of the present invention, not allembodiments. Other embodiments derived by those skilled in the art onthe basis of the embodiments herein without any creative effort fallwithin the protection scope of the present invention.

FIG. 1 is a flow chart of a method for video matting via sparse andlow-rank representation according to a first embodiment of the presentinvention. An executive entity of the method can be a computer or otherprocessing devices. As shown in FIG. 1, the method provided by thepresent embodiment includes:

S101, Determining known pixels and unknown pixels in an input video,setting opacity α values of the known pixels, and selecting frames whichcan represent video characteristics in the input video as keyframes.

Particularly, the known pixels include foreground pixels and backgroundpixels, wherein, the foreground pixels are pixels in an area where acontent of an image needs to be extracted and the background pixels arepixels in an area where a content of an image does not need to beextracted. The known pixels are pixels which can be clearly identifiedas belonging to the foreground pixels or the background pixels accordingto the input video, and the unknown pixels are pixels in an area where aforeground image and a background image overlap which areindistinguishable.

When determining the known pixels and the unknown pixels, pen-basedinteraction marking can be applied to determine the known pixels and theunknown pixels in the input video. For example, a pen is applied to markthe foreground pixels and the background pixels in a video image,wherein, pixels covered by a white colored pen are known foregroundpixels, pixels covered by a black colored pen are known backgroundpixels, and other pixels without being marked by a pen are the unknownpixels.

Alternatively, it is also possible to determine the known pixels and theunknown pixels in the video according to a trimap of the input video.Specifically, a black-white-gray trimap of a same size with the inputvideo can be provided, wherein, pixels corresponding to a white area arethe known foreground pixels, pixels corresponding to a black area arethe known background pixels, and pixels corresponding to a gray area arethe unknown pixels.

It should be noted that, when determining the known pixels and theunknown pixels in the input video, the above method can be adopted toprocess the whole input video according to an actual situation, orprocess a part of the input video, and all pixels in other video imagewithout determining the known pixels are determined as the unknownpixels.

After determining the known pixels in the input video, opacity α valuesof the known pixels can be set, wherein, set opacity α values of theforeground pixels which need to be extracted as larger values and setopacity α values of the background pixels which do not need to beextracted as smaller values. Preferably, in the present embodiment, setthe opacity α values of the known foreground pixels as the maximum value1 and set the opacity α values of the known background pixels as theminimum value 0.

Moreover, since the data amount of the entire input video is large, inthe present embodiment, select frames which can represent the videocharacteristics in the input video as the keyframes to train thedictionary, so as to reduce calculating amount. When selecting thekeyframes, it is possible to select one frame image per several frameimages and take the selected frame images as the keyframes, or it ispossible to select more frame images in video segment with largevariation and select less frame images in video segment with smallvariation, which can be selected arbitrarily according to specificcircumstances, as long as the video characteristics can be represented.

S102, Training a dictionary according to the known pixels in thekeyframes, and setting α values of sample points in the dictionary.

After obtaining the keyframes, the dictionary can be trained accordingto the known pixels in the keyframes directly. The dictionary includes aforeground dictionary and a background dictionary, a characteristicspace of which is a five dimensional characteristic space includingRGBXY characteristic values included therein, wherein RGB is RGB colorvalue of the pixels and XY is coordinate position of the pixels in theimage. The training process of the dictionary can be transformed tominimizing the following energy equation:

$\begin{matrix}{\underset{({D,Z})}{\arg \; \min}{\sum\limits_{i,j}\left( {{{\hat{X} - {D\; Z}}}_{F}^{2} + {{{\hat{X}}_{i} - {D_{i}Z_{i}}}}_{F}^{2} + {\sum\limits_{j \neq i}{{D_{j}Z_{i}^{j}}}_{F}^{2}}} \right)}} & (1)\end{matrix}$

Wherein, {circumflex over (X)}={{circumflex over (X)}_(f),{circumflexover (X)}_(b)} represents the known pixels in the keyframes; {circumflexover (X)}_(f) and {circumflex over (X)}_(b) represent the knownforeground pixels and background pixels in the keyframes, respectively;D={D_(f),D_(b)} represents the trained dictionary, D_(f) and D_(b)represent the foreground dictionary and the background dictionary,respectively; Z_(f)={Z_(f) ^(f),Z_(f) ^(b)} represents a reconstructioncoefficient of the foreground pixel of {circumflex over (X)}_(f)corresponding to the dictionary D, Z_(b)={Z_(b) ^(f),Z_(b) ^(b)}represents a reconstruction coefficient of the background pixel of{circumflex over (X)}_(b) corresponding to the dictionary D, and {Z_(i)^(j)|i,j=f,b} represents a reconstruction coefficient of a known point{circumflex over (X)}_(i) corresponding to the dictionary D_(j).

In the above formula (1), the first term ∥X−DZ∥_(F) ² represents thatthe dictionary can reconstruct all known pixels, so as to guarantee astrong representative ability of the dictionary; the second term∥X_(i)−D_(i)Z_(i)∥_(F) ² represents that dictionary D_(i) canreconstruct the known pixel X_(i), namely the foreground pixels can bereconstructed by the foreground dictionary and the background pixels canbe reconstructed by the background dictionary; the third term

$\sum\limits_{j \neq i}{{D_{j}Z_{i}^{j}}}_{F}^{2}$

restrains that the reconstruction coefficient Z_(i) ^(j) of the knownpixel X_(i) corresponding to the dictionary D_(i) is closing to 0,namely the foreground points may response to the foreground dictionarybut may almost not response to the background dictionary, and thebackground points may response to the background dictionary but mayalmost not response to the background dictionary, that is to say, theforeground points are reconstructed by the foreground dictionary butcannot be reconstructed by the background dictionary, and the backgroundpoints are reconstructed by the background dictionary but cannot bereconstructed by the foreground dictionary.

Regarding to α value of each sample point in the dictionary, α values ofsample points in the foreground dictionary may be set as 1, and the αvalues of sample points in the background dictionary may be set as 0.

S103, Obtaining a reconstruction coefficient of the input videocorresponding to the dictionary according to the dictionary and settinga non-local relationship matrix between each pixel in the input videoaccording to the reconstruction coefficient.

Since the pixels of the same object described in different frames comefrom an identical characteristic subspace, each pixel can be expressedby the elements in the characteristic subspace through linearcombination, therefore the entire video can be reconstructed bydictionary with a representation matrix (namely the followingreconstruction coefficient matrix) of low-rank and sparse. Regarding theentire video, each pixel corresponding to dictionary D has onereconstruction coefficient. The reconstruction coefficient of the entireinput video can be expressed as:

$W = {\left\{ {W_{1},K,W_{n}} \right\} = {\begin{matrix}{\left( w_{1} \right)_{1,1}L} & {\left( w_{i} \right)_{1,p}L} & \left( w_{n} \right)_{1,m} \\{M\mspace{14mu} O} & M & M \\{\left( w_{1} \right)_{q,1}L} & {\left( w_{i} \right)_{q,p}L} & \left( w_{n} \right)_{q,m} \\M & {M\mspace{14mu} O} & M \\{\left( w_{1} \right)_{t,1}L} & {\left( w_{i} \right)_{t,p}L} & \left( w_{n} \right)_{t,m}\end{matrix}}}$

The solving process of the reconstruction coefficient of the entirevideo can be ascribed to minimizing the following energy equation:

$\begin{matrix}{{{\min {\sum\limits_{i}^{n}\left( {{{X_{i} - {DW}_{i}}}_{0} + {W_{i}}_{0}} \right)}} + {W}_{*}}\mspace{14mu} {{\forall p},q,{\left( w_{i} \right)_{p,q} \in W_{i}},{{s.t.\left( w_{i} \right)_{p,q}}>=0.}}} & (2)\end{matrix}$

Wherein, X={X₁, . . . , X_(n)}, n represents that there are n frames inthe input video, X_(i) represents the RGBXY characteristics of thei^(th) frame, ∥•∥_(*) represents a nuclear norm, which is the sum ofsingular values of the matrix and is used for restraining thereconstruction coefficient being low-rank, ∥•∥₀ represents a zero norm,which is the number of nonzero elements and is used for restraining thereconstruction coefficient being sparse, m represents that the number ofpixels in each frame is m, t represents that the number of sample pointsin the dictionary D is t, (w_(i))_(q,p) represents the reconstructioncoefficient of the p^(th) pixel in the i^(th) frame corresponding to theq^(th) sample point in the dictionary.

In the above formula (2), the restriction of sparse may guarantee thateach pixel in the video can be reconstructed by several elements in thedictionary, and the restriction of low-rank may guarantee the temporaland spatial consistency of video alpha matte. Specifically, therestriction of low-rank may guarantee that pixels possessing similarcharacteristic in one frame can be reconstructed by the same elements inthe dictionary, therefore guarantee consistency of video alpha matte inspatial domain; the restriction of low-rank may also guarantee pixelspossessing similar characteristic in continuous video can also bereconstructed by the same elements in the dictionary, thereforeguarantee the temporal and spatial consistency of video alpha matte.Preferably, low-rank requires the rank of W is far less than the numberof its rows and the number of its columns, and sparse requires thenumber of 0 in W is more than 50% thereof.

After solving the reconstruction coefficient of the input video, anon-local relationship matrix can be set between each pixel in the inputvideo according to the reconstruction coefficient:

$\begin{matrix}{\min {\sum\limits_{i}^{n}{\sum\limits_{j}^{m}\left( {\alpha_{ij} - {\alpha_{D}w_{ij}}} \right)^{2}}}} & (3)\end{matrix}$

Wherein, α_(ij) represents the α value of the j^(th) pixel in i^(th)frame, m represents the number of pixels in each frame is m,α_(D)={α_(f),α_(b)} represents α values of all sample points in thedictionary D, α_(f)=1 represents α values of sample points in theforeground dictionary, α_(b)=0 represents α values of sample points inthe background dictionary, w_(ij)=[(w_(i))_(1,j), . . . , (w_(i))_(t,j)]represents the reconstruction coefficient of the j^(th) pixel in thei^(th) frame corresponding to the dictionary D.

The above solved reconstruction coefficient satisfies the restriction oflow-rank and sparse, therefore the non-local relationship matrixreconstructed according to the reconstruction coefficient may guaranteethe temporal and spatial consistency of video alpha matte in non-localrelationship.

S104, Setting a Laplace matrix between multiple frames.

When setting the non-local relationship, a Laplace matrix between framesmay also be set at the same time so as to guarantee the temporal andspatial consistency of video alpha matte in local relationship.Particularly, the Laplace matrix W_(ij) ^(mlap) can be set betweenmultiple frames according to formula (4):

$\begin{matrix}{W_{ij}^{mlap} = {\delta {\sum\limits_{k}^{{({i,j})} \in c_{k}}{\frac{1 + {\left( {C_{i} - \mu_{k}} \right)\left( {\Sigma_{k} + {\frac{\delta}{d \times \mu^{2}}I}} \right)^{- 1}\left( {C_{j} - \mu_{k}} \right)}}{d \times m^{2}}.}}}} & (4)\end{matrix}$

Wherein, δ controls intensity of local smoothness, k represents thenumber of the windows in one frame, c_(k) represents the k^(th) window,C_(i) represents a color value of the i^(th) pixel, μ_(k) and Σ_(k)represent mean and variance of the color in the window respectively, òrepresents a normal coefficient, d×m² is a size of the window whichrepresents selecting neighboring d frames and each frame selects pixelsin m² window as neighbors, and I represents an identity matrix.

Extending the above Laplace matrix from single frame image tomulti-frame image, besides pixels in neighboring windows in the presentframe, pixels in neighboring windows in neighboring video frame are alsoconsidered, neighbors are constituted by these pixels to construct acolor line model of the point, therefore not only the local smoothnessof the video alpha matte can be enhanced, the temporal and spatialconsistency of video alpha matte of neighboring frames can also beenhanced.

Preferably, in the above equation (4), set the normal coefficient ò as10⁻⁵, set m as 3, and set d as 2. FIG. 2 is a construction diagram ofLaplace matrix between multiple frames in the present invention. Asshown in FIG. 2, the figure illustrates a method of constructing twoframes Laplace matrix. Regarding the present frame j, besides the pixelsin 3×3 window in present frame, the pixels in 3×3 window in neighboringframes are also considered, the two parts of pixels construct theneighbors of the pixel j, and then construct the Laplace matrix.

It should be noted that there is no strict timing relationship betweenstep S103 and step S104, step S104 may be performed before step S103 andmay also be performed simultaneously with the step S103.

S105, Obtaining a video alpha matte of the input video, according to αvalues of the known pixels of the input video and α values of samplepoints in the dictionary, the non-local relationship matrix and theLaplace matrix.

According to the α values of all known pixels determined in S101, the αvalue of each sample point in the trained dictionary determined in S102,the non-local relationship matrix constructed in S103 and the Laplacematrix constructed in S104, an energy equation of all pixels incontinuous multiple frames can be constructed so as to obtain aprobability (i.e., α value) of each pixel belonging to foreground,therefore the video alpha matte is obtained.

Particularly, the energy equation may be constructed according toformula (5):

$\begin{matrix}{E = {{\lambda {\sum\limits_{s \in S}\left( {\alpha_{s} - g_{s}} \right)^{2}}} + {\sum\limits_{i = 1}^{n}{\sum\limits_{j = 1}^{m}\left( {\alpha_{ij} - {\alpha_{D}w_{ij}}} \right)^{2}}} + {\sum\limits_{i = 1}^{n}{\sum\limits_{j = 1}^{m}\left( {\sum_{k \in N_{i}}{W_{jt}^{mlap}\left( {\alpha_{ij} - \alpha_{k}} \right)}} \right)^{2}}}}} & (5)\end{matrix}$

Wherein, S represents a set constructed by α values of the known pixelsof the input video and α values of sample points in the dictionary,N_(j) is adjacent points of pixel j in d×m² window, g_(s)=1 representspixel s in set S is a foreground pixel, and g_(s)=0 represents pixel sin set S is a background pixel.

After obtaining the α values of the unknown pixels in the input videoaccording to the above formula (5), then combine with the α values ofthe known pixels in the input video, the video alpha matte of the inputvideo can be obtained.

The solving of the above formula (5) can be realized by following way:

The above energy equation E can be expressed in a matrix form as:

E=(α−G)^(T)Λ(α−G)+α^(T) Lα  (6)

Wherein, Λ is a diagonal matrix, if the pixel s belonging to the set Sthen set Λ_(ss) as a extremely large constant, such as 200, otherwiseset to 0, G is a vector, a value of which determines α values in stepS102, if the pixel s belonging to the known foreground pixels then setG_(s) to 1, otherwise set to 0,

${L = \begin{bmatrix}L_{D} & {- W} \\{- W^{T}} & L_{u}\end{bmatrix}},$

wherein W is a reconstruction coefficient matrix of the input videocorresponding to the dictionary D, L_(D)=W*W^(T), diagonal line L_(u) isa multi-frame Laplace matrix of each frame, that is L_(u)=diag(W₁^(mlap); . . . ; W_(n) ^(mlap)). The matrix form expression (6) of theabove energy equation is a quadratic equation concerning α, and the αvalues can be minimized by resolving the following linear equation:

(Δ+L)α=ΔG  (7)

The above equation is a sparse linear equation set, a global optimumclosed-form solution can be obtained by preprocessing a conjugategradient method.

S106, Extracting a foreground object in the input video according to thevideo alpha matte.

Regarding the input video X, each pixel X_(i) of which is a linearcombination of color F_(i) of the foreground image and color B_(i) ofthe background image, that is, X_(i)=F_(i)×α_(i)+B_(i)×(1−α_(i)),therefore multiply α value of each pixel in the obtained video alphamatte with the each pixel in the input video, namely the foregroundobject of the input video can be extracted, which particularly can beexpressed by formula as:

C=X×α  (7)

Wherein, C represents the extracted foreground object of the inputvideo, X represents the input video, α represents α value of each pixelin video alpha matte corresponding to the input video.

Existing image matting method based on sparse representation only usesforeground pixels to reconstruct original image, which fails toguarantee the temporal and spatial consistency, since only foregroundpixels are used as the dictionary, the representative ability is poor,leading to that the quality of the foreground object extracted byapplying said method is poor. Comparing to said method, in the presentembodiment, that pixels with similar characteristic possess similar αvalues is guaranteed by the restriction of low-rank and the Laplacematrix between multiple frames, therefore the temporal and spatialconsistency of video alpha matte is guaranteed. Furthermore, the knownpixels used for training the dictionary include the background pixelsand the foreground pixels, the constructed foreground dictionary andbackground dictionary possessing strong discriminative ability, andstrong representative ability, therefore improving the quality ofextracted foreground object effectively. In addition, in the presentembodiment, only the pixels in the keyframes are used to train thedictionary, the calculation load is small. In existing obtaining videoalpha matte method by introducing non-local prior method, fixed numberof sample points are selected to reconstruct original image. It isdifficult to construct a consistent non-local structure for pixelspossessing similar characteristics which may result in temporal andspatial inconsistent video alpha matte, therefore the quality of aforeground object extracted by adopting said method is poor. However, inthe method provided by the present embodiment, firstly foregrounddictionary and background dictionary are constructed according to theknown pixels, then sample points in the dictionary are selectedautomatically while solving reconstruction coefficient by therestriction of low-rank, which guarantees that pixels with similarcharacteristic possessing similar α values by the restriction oflow-rank and the Laplace matrix between multiple frames, thereforeguarantees the temporal and spatial consistency of video alpha matte.Furthermore, the quality of extracted foreground object is improvedeffectively.

The method provided by the present embodiment, after many experiments,possesses obvious advantages when dealing with the problem of blur leftbehind by a fast-moving object, edge of semitransparent object anddifferent translucencies, and object with large topology variation,which may be widely applied to video program production and other imageprocessing fields.

The method for video matting via sparse and low-rank representationprovided by the present embodiment, a dictionary with strongrepresentative and discriminative ability is trained according to theknown foreground pixels and background pixels in the selected keyframes,then reconstruction coefficient is obtained which satisfies therestriction of low-rank, sparse and non-negative according to thedictionary, the non-local relationship matrix between each pixel is setin the input video according to the reconstruction coefficient,meanwhile the Laplace matrix between multiple frames is set, thereforeguarantees the temporal and spatial consistency of video alpha matte andlocal smoothness of the obtained input video. Furthermore, the qualityof foreground object extracted according the video alpha matte isimproved effectively.

FIG. 3 is a flow chart of a method for video matting via sparse andlow-rank representation according to a second embodiment of the presentinvention. The present embodiment mainly illustrates a detailed step ofobtaining reconstruction coefficient of the input video corresponding tothe dictionary in the above step S103. On the basis of the aboveembodiment, as shown in FIG. 3, in the present embodiment, the obtainingthe reconstruction coefficient of the input video corresponding to thedictionary according to the dictionary in step S103, specificallyincludes:

S201, Transforming formula (2) to formula (8) equivalently:

$\begin{matrix}{{{\min {\sum\limits_{i}^{n}\left( {{W_{i}}_{1} + {\lambda {E_{i}}_{1}}} \right)}} + {\gamma {W}_{*}}}{{{s.t.X_{i}} = {{D_{i}S_{i}} + E_{i}}};}{{W_{i} = J_{i}};}{{W_{i} = S_{i}};}{{W_{i} = T_{i}},{T_{i}>=0.}}} & (8)\end{matrix}$

Wherein, X_(i) represents the RGBXY characteristics of the i^(th) frame,λ and γ represent equilibrium coefficients, andS₁,K,S_(n),K,J₁,K,J_(n),K,T₁,K,T_(n) is an auxiliary variable.

S202, Transforming formula (3) to formula (9) equivalently:

$\begin{matrix}{\min \left( {{\gamma {W}_{*}} + {\sum\limits_{i}^{n}\left( {{J_{i}}_{1} + {\lambda {E_{i}}_{1}}} \right)} + {\sum\limits_{i}^{n}\left( {{\langle{A_{i},{W_{i} - J_{i}}}\rangle} + {\langle{Y_{i},{X_{i} - {D_{i}S_{i}} - E_{i}}}\rangle} + {\langle{V_{i},{W_{i} - S_{i}}}\rangle} + {\langle{U_{i},{W_{i} - T_{i}}}\rangle} + {\frac{\mu}{2}{{X_{i} - {D_{i}S_{i}} - E_{i}}}_{F}^{2}} + {\frac{\mu}{2}{{W_{i} - J_{i}}}_{F}^{2}} + {\frac{\mu}{2}{{W_{i} - S_{i}}}_{F}^{2}} + {\frac{\mu}{2}{{W_{i} - T_{i}}}_{F}^{2}}} \right)}} \right)} & (9)\end{matrix}$

Wherein, E_(i) is a reconstruction error for the i^(th) frame,A₁,K,A_(n),K,Y₁,K,Y_(n),K,V₁,K,V_(n),U₁,K,U_(n) is a Lagrangianmultiplier.

S203, Solving formula (9) by using an alternating direction method(alternating direction method, ADM).

The ADM algorithm is namely an inexact augmented Lagrange multiplierMethod (inexact Augmented Lagrange Multiplier Method, inexact ALM),which mainly uses an iterative solution method, and input variablesincludes video X with n frames, dictionary D, and equilibriumcoefficients λ and γ. Specific steps are as follows:

First, initializing A=U=V=Y=0,S=T=J=0,μ=10⁻⁶, and then start theiterative process:

1, Fixing other variables, and updating J_(i). The specific formula is:

$J_{i} = {{\underset{J_{i}}{\arg \; \min}\frac{1}{\mu}{J_{i}}_{1}} + {\frac{1}{2}{{{J_{i} - \left( {W_{i} + \frac{A_{i}}{\mu}} \right)}}_{F}^{2}.}}}$

2, Fixing other variables, and updating S_(i). The specific formula is:

$S_{i} = {\left( {{D^{T}D} + I} \right)^{- 1}{\left( {{D^{T}\left( {X_{i} - E_{i}} \right)} + W_{i} + \frac{\left( {{D^{T}Y_{i}} + V_{i}} \right)}{\mu}} \right).}}$

3, Fixing other variables, and updating T_(i). The specific formula is:

${T_{i} = {W_{i} + \frac{U_{i}}{\mu}}},{T_{i} = {{\max \left( {T_{i},0} \right)}.}}$

4, Fixing other variables, and updating W. The specific formula is:

${W_{i} = {{\underset{W}{\arg \; \min}\frac{\gamma}{2\mu}{W}_{*}} + {\frac{1}{2}{{{W - M}}_{F}^{2}.{Wherein}}}}},{M = \left\lbrack {F_{1},F_{2},K,F_{n}} \right\rbrack},{and}$$F_{i} = {\frac{1}{3}{\left( {{J_{i} + S_{i} + T_{i}} = \frac{\left( {A_{i} + V_{i} + U_{i}} \right)}{\mu}} \right).}}$

5, Fixing other variables, and updating E_(i). The specific formula is:

$E_{i} = {{\underset{E_{i}}{\arg \; \min}\frac{\lambda}{\mu}{E_{i}}_{1}} + {\frac{1}{2}{{{E_{i} - \left( {X_{i} - {DS}_{i} + \frac{Y_{i}}{\mu}} \right)}}_{F}^{2}.}}}$

6, Updating each Lagrangian multiplier A_(i),Y_(i),V_(i). The specificformula is:

A _(i) =A _(i)+μ(W _(i) −J _(i)),Y _(i) =Y _(i)+μ(X _(i) −DS _(i) −E_(i)),

V _(i) =V _(i)+μ(W _(i) −S _(i)),U _(i) =U _(i)+μ(W _(i) −T _(i)).

7, Updating μ. The specific formula is:

μ=min(1.1μ,10¹⁰.(ρ=1.9).

8, Check whether a condition of convergence is achieved, i.e.,X_(i)−DS_(i)−E_(i)→0,W_(i)−J_(i)→0,W_(i)−S_(i)→0 and W_(i)−T_(i)→0. Ifit does not converge, continue iteration until it converges or reachesthe maximum number of iteration.

Finally, it should be noted that the above embodiments are merelyprovided for describing the technical solutions of the presentinvention, but not intended to limit the present invention. It should beunderstood by persons skilled in the art that although the presentinvention has been described in detail with reference to the foregoingembodiments, modifications can be made to the technical solutionsdescribed in the foregoing embodiments, or equivalent replacements canbe made to partial or all technical features in the technical solutions;however, such modifications or replacements do not cause the essence ofcorresponding technical solutions to depart from the scope of theembodiments of the present invention.

What is claimed is:
 1. A method for video matting via sparse andlow-rank representation, comprising: determining known pixels andunknown pixels in an input video, setting opacity α values of the knownpixels, and selecting frames which can represent video characteristicsin the input video as keyframes; training a dictionary according to theknown pixels in the keyframes, and setting α values of sample points inthe dictionary; obtaining a reconstruction coefficient of the inputvideo corresponding to the dictionary according to the dictionary, andsetting a non-local relationship matrix between each pixel in the inputvideo according to the reconstruction coefficient; setting a Laplacematrix between multiple frames; obtaining a video alpha matte of theinput video, according to the α values of the known pixels of the inputvideo and the α values of the sample points in the dictionary, thenon-local relationship matrix and the Laplace matrix; and extracting aforeground object in the input video according to the video alpha matte.2. The method according to claim 1, wherein, the determining the knownpixels and the unknown pixels in the input video, specificallycomprises: determining the known pixels and the unknown pixels in theinput video by using a pen-based interaction marking; or determining theknown pixels and the unknown pixels in the input video according to atrimap of the input video.
 3. The method according to claim 1, wherein,the setting the opacity α values of the known pixels, specificallycomprises: setting α values of known foreground pixels as 1, and settingα values of known background pixels as
 0. 4. The method according toclaim 1, wherein, the training the dictionary according to the knownpixels in the keyframes, specifically comprises: training the dictionaryby minimizing following energy equation (1): $\begin{matrix}{\underset{({D,Z})}{\arg \; \min}{\sum\limits_{i,j}\left( {{{\hat{X} - {DZ}}}_{F}^{2} + {{{\hat{X}}_{i} - {D_{i}Z_{i}}}}_{F}^{2} + {\sum\limits_{j \neq i}{{D_{j}Z_{i}^{j}}}_{F}^{2}}} \right)}} & (1)\end{matrix}$ wherein, {circumflex over (X)}={{circumflex over(X)}_(f),{circumflex over (X)}_(b)} represents the known pixels in thekeyframes; {circumflex over (X)}_(f) and {circumflex over (X)}_(b)represent the known foreground pixels and background pixels in thekeyframes respectively; D={D_(f),D_(b)} represents the traineddictionary, D_(f) and D_(b) represent the foreground dictionary and thebackground dictionary respectively; Z_(f)={Z_(f) ^(f),Z_(f) ^(b)}represents a reconstruction coefficient of the foreground pixel{circumflex over (X)}_(f) corresponding to the dictionary D,Z_(b)={Z_(b) ^(f),Z_(b) ^(b)} represents a reconstruction coefficient ofthe background pixel {circumflex over (X)}_(b) corresponding to thedictionary D, and {Z_(i) ^(j)|i,j=f,b} represents a reconstructioncoefficient of a known point {circumflex over (X)}_(i) corresponding tothe dictionary D_(j).
 5. The method according to claim 4, wherein, theobtaining the reconstruction coefficient of the input videocorresponding to the dictionary according to the dictionary,specifically comprises: obtaining the reconstruction coefficient of theinput video corresponding to the dictionary by minimizing followingenergy equation (2): $\begin{matrix}{{{\min {\sum\limits_{i}^{n}\left( {{{X_{i} - {DW}_{i}}}_{0} + {W_{i}}_{0}} \right)}} + {{W}_{*}{\forall p}}},{{q\left( w_{i} \right)}_{p,q} \in W_{i}},{{s.t.\left( w_{i} \right)_{p,q}}>=0.}} & (2)\end{matrix}$ wherein, X={X₁, . . . , X_(n)}, n represents that thereare n frames in the input video, X_(i) represents RGBXY characteristicsof the i^(th) frame, ∥•∥_(*) represents a nuclear norm, which is the sumof singular values of the matrix and is used for restraining thereconstruction coefficient being low-rank, ∥•∥₀ represents a zero norm,which is a number of nonzero elements and is used for restraining thereconstruction coefficient being sparse,${W = {\left\{ {W_{1},K,W_{n}} \right\} = {\begin{matrix}{\left( w_{1} \right)_{1,1}L} & {\left( w_{i} \right)_{1,p}L} & \left( w_{n} \right)_{1,m} \\{M\mspace{14mu} O} & M & M \\{\left( w_{1} \right)_{p,1}L} & {\left( w_{i} \right)_{q,p}L} & \left( w_{n} \right)_{q,m} \\M & {M\mspace{14mu} O} & M \\{\left( w_{1} \right)_{t,1}L} & {\left( w_{i} \right)_{t,p}L} & \left( w_{n} \right)_{t,m}\end{matrix}}}},$ m represents that a number of pixels in each frame ism, t represents that a number of sample points in the dictionary D is t,(w_(i))_(q,p) represents the reconstruction coefficient of the p^(th)pixel in the i^(th) frame corresponding to the q^(th) sample point inthe dictionary.
 6. The method according to claim 5, wherein, the settingthe non-local relationship matrix between each pixel in the input videoaccording to the reconstruction coefficient, specifically comprises:setting the non-local relationship matrix according formula (3):$\begin{matrix}{\min {\sum\limits_{i}^{n}{\sum\limits_{j}^{m}\left( {\alpha_{ij} - {\alpha_{D}w_{ij}}} \right)^{2}}}} & (3)\end{matrix}$ wherein, α_(ij) represents the α value of the j^(th) pixelin i^(th) frame, m represents the number of pixels in each frame is m,α_(D)={α_(f),α_(b)} represents α values of all sample points in thedictionary D, α_(f)=1 represents α values of sample points in theforeground dictionary, α_(b)=0 represents α values of sample points inthe background dictionary, w_(ij)=[(w_(i))_(1,j), . . . , (w_(i))_(t,j)]represents the reconstruction coefficient of the j^(th) pixel in thei^(th) frame corresponding to the dictionary D.
 7. The method accordingto claim 6, wherein, the setting Laplace matrix between multiple frames,specifically comprises: setting a Laplace matrix between multiple framesaccording formula (4): $\begin{matrix}{W_{ij}^{mlap} = {\delta {\sum\limits_{k}^{{({i,j})} \in c_{k}}{\frac{1 + {\left( {C_{i} - \mu_{k}} \right)\left( {\sum_{k}{{+ \frac{ò}{d \times m^{2}}}I}} \right)^{- 1}\left( {C_{j} - \mu_{k}} \right)}}{d \times m^{2}}.}}}} & (4)\end{matrix}$ wherein, W_(ij) ^(mlap) represents a Laplace matrix, δcontrols intensity of local smoothness, k represents a number of thewindows in one frame, c_(k) represents the k^(th) window, C_(i)represents a color value of the i^(th) pixel, μ_(k) and Σ_(k) representmean and variance of the color in the window respectively, ò representsa normal coefficient, d×m² is a size of the window which representsselecting neighboring d frames and each frame selects pixels in m²window as neighbors, and I represents an identity matrix.
 8. The methodaccording to claim 7, wherein the normal coefficient ò is set as 10⁻⁵, mas 3, and d as
 2. 9. The method according to claim 7, wherein, theobtaining the video alpha matte of the input video, according to αvalues of the known pixels of the input video and α values of samplepoints in the dictionary, the non-local relationship matrix and theLaplace matrix, specifically comprises: obtaining α values of theunknown pixels of the input video according formula (5): $\begin{matrix}{E = {{\lambda {\sum\limits_{s \in S}\left( {\alpha_{s} - g_{s}} \right)^{2}}} + {\sum\limits_{i = 1}^{n}{\sum\limits_{j = 1}^{m}\left( {\alpha_{ij} - {\alpha_{D}w_{ij}}} \right)^{2}}} + {\sum\limits_{i = 1}^{n}{\sum\limits_{j = 1}^{m}\left( {\sum_{({k \in N_{j}}}{W_{jt}^{mlap}\left( {\alpha_{ij} - \alpha_{k}} \right)}} \right)^{2}}}}} & (5)\end{matrix}$ wherein, S represents a set constructed by α values of theknown pixels of the input video and α values of sample points in thedictionary, N_(j) is adjacent points of pixel j in d×m² window, g_(s)=1represents pixel s in set S is a foreground pixel, and g_(s)=0represents pixel s in set S is a background pixel; and obtaining thevideo alpha matte of the input video according to α values of the knownpixels and α values of the unknown pixels of the input video.